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Abstract 

Missing data are unavoidable in environmental epidemiologic surveys. The aim of this study was to compare methods for 
handling large amounts of missing values: omission of missing values, single and multiple imputations (through linear 
regression or partial least squares regression), and a fully Bayesian approach. These methods were applied to the PARIS birth 
cohort, where indoor domestic pollutant measurements were performed in a random sample of babies' dwellings. A 
simulation study was conducted to assess performances of different approaches with a high proportion of missing values 
(from 50% to 95%). Different simulation scenarios were carried out, controlling the true value of the association (odds ratio 
of 1.0, 1.2, and 1.4), and varying the health outcome prevalence. When a large amount of data is missing, omitting these 
missing data reduced statistical power and inflated standard errors, which affected the significance of the association. Single 
imputation underestimated the variability, and considerably increased risk of type I error. All approaches were conservative, 
except the Bayesian joint model. In the case of a common health outcome, the fully Bayesian approach is the most efficient 
approach (low root mean square error, reasonable type I error, and high statistical power). Nevertheless for a less prevalent 
event, the type I error is increased and the statistical power is reduced. The estimated posterior distribution of the OR is 
useful to refine the conclusion. Among the methods handling missing values, no approach is absolutely the best but when 
usual approaches (e.g. single imputation) are not sufficient, joint modelling approach of missing process and health 
association is more efficient when large amounts of data are missing. 
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Introduction 

In epidemiological studies, accurate estimate of exposure is very 
important for assessment of health risk. However missing data are 
often unavoidable, resulting from loss to follow-up in longitudinal 
studies, or non-responses in questionnaires. In large scale studies, 
certain strategies have been developed reducing the high cost of 
environmental measurements, e.g. collecting exposure surrogates 
for all subjects generally by questionnaires, and performing exact 
personal or environmental measurements only in subsamples of 
population [1]; most of the time these subsamples are small due to 
economic and logistic reasons (high cost, noise, bulk samplers...). 
For instance, in the National health and nutrition examination 
survey (NHANES), personal exposure to volatile organic com- 
pounds was assessed in a subsample of 851 adults, i.e. 8.5% of the 
population study [2]. Furthermore, even these specific studies can 
suffer from missing data due to measuring instrument failure, 
routine maintenance of monitors, and human error. Whatever the 
reasons for incomplete data, it can be a significant obstacle for 
researchers. 

Most statistical software omits records with missing values by 
default, and analysis is conducted on a subset with the available 
data. This approach is commonly used for handling missing data, 



but can lead to loss of statistical power which can be problematic 
in environmental surveys where associations between environ- 
mental factors and health outcome are generally weak. Further- 
more, the results of the complete cases analyses are imprecise, 
given that part of the data is not considered. 

An alternative is to use measurements issued from subsamples to 
build predictive models and then, apply them to the whole 
population. Among these imputation techniques, single imputation 
approach is the most common and easily conducted as standard 
methodology [3], it involves a single estimated value for each 
missing value. It can be applied directly without loss of power due 
to the sample size being brought back to its original size. However, 
single imputation ignores the uncertainty of estimation due to the 
imputation. 

Consequendy, in 1987, Rubin proposed multiple imputation 
[4], where each missing value is imputed by multiple simulated 
data leading to multiple "completed datasets". Each generated 
dataset is analyzed by standard methodology and the results 
combined, enabling the uncertainty attached to missing data to be 
assessed. Whilst several authors have declared multiple imputation 
their method of choice [5,6], a recent review has suggested that its 
use is still quite rare: less than 2% of papers published in 
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epidemiology journals have used multiple imputation and often 
omitting important details of the methodology used [7] . 

Another more recent alternative for dealing with missing data is 
to jointly model through a Bayesian approach the missing process 
and the association between health outcome and covariates [8,9]. 
This kind of models is a part of more general approaches referred 
to hierarchical Bayesian modelling [10] which combines several 
sub-models. In our case, there are two sub-models where the first 
one connects missing exposure and predictive factors, and the 
second one assesses the association between health outcome and 
covariates including missing exposure. These two sub-models 
could be implemented separately but the estimated exposure by 
the first sub-model would be used in the second sub-model of 
disease as if it had been measured without uncertainty. Through 
the Bayesian modelling, the sub-models are integrated together 
allowing to take into account all uncertainty. 

The aim of this paper is to examine performances of several 
approaches for handling large amounts of missing data in 
environmental epidemiological surveys when the data are missing 
completely at random (MCAR) using both a case study and a 
simulation study. Omitting missing values, imputation techniques 
(single and multiple imputation) and fully Bayesian approach are 
considered. 

Materials and Methods 

In order to compare results from methods for handling missing 
values, a real dataset was used. In the PARIS (Pollution and 
Asthma Risk: an Infant Study) cohort, measurements were 
performed and pollutant levels, such as formaldehyde levels, are 
available in a subset of the population, but missing for infants not 
involved in the environmental investigation [1 1]. 

Data 

The cohort enrolled 3 840 full-term healthy babies, recruited 
between February 2003 and June 2006. The study protocol is 
described elsewhere [12]. At birth, an interview with the mother 
was conducted to collect data about the history of allergic 
conditions in both parents. Gender, parity, anthropometric 
parameters of the child, maternal history of pregnancy and 
delivery were also registered from newborn's and mother's medical 
records. Parents regularly documented health outcomes in mailed 
questionnaires. The health outcomes of interest were defined by 
the occurrence of lower respiratory infection (LRI) and a dry night 
cough (DNC) during the first twelve months of life [11,13]. 

Concerning environmental and lifestyle data, a trained inter- 
viewer interviewed parents by phone during maternity leave to 
describe in detail home characteristics (construction date, number 
of occupants, home surface area, heating and cooking appliances, 
presence of mechanical ventilation and double glazing, wall and 
floor coverings and signs of dampness) and family living conditions 
(duration of breastfeeding, information on day-care attendance, 
keeping of pets, aeration, smoking, use of air fresheners, do-it- 
yourself (DIY)). Any changes were assessed by mailed question- 
naires at each time points. 

Aldehyde air sampling measurements were performed in a 
random sample of 196 babies'dwellings using a passive sampler 
[14]. Predictive factors of formaldehyde levels were previously 
identified: sources (presence and age of wall coating, wood-pressed 
products for flooring or varnished parquet floor, and particle 
board furniture), parameters of aeration and air stuffiness (length 
of window opening, presence of mechanical ventilation and double 
glazing), and home characteristics (construction date, housing 
area, and number of occupants) [11]. 



This study was approved by the National Ethics Committee 
(permissions 031153 and 051289), and parents of participating 
infants gave their written informed consent. Data are stored in the 
Paris council, within its Social, Childhood and Health Direction 
(DASES). All data were anonymized before statistical analysis. 

Association between formaldehyde, including missing 
values, and health outcomes in the PARIS cohort 

In the context of the study of formaldehyde exposure impact 
(variable including missing values) on LRI (a relative common 
health outcome in infancy) or DNC (a less prevalent event), results 
from methods for handling missing values were compared. All 
infants who move during the first year of life and those with no 
data on health outcome were excluded from the analyses. 

Unmeasured formaldehyde values are assumed MCAR [15] 
since families where measurements were carried out were selected 
at random, values are missing by design. The methods for dealing 
with missing data that we considered were: omitting missing data, 
imputation methodologies, and a fully Bayesian approach. For the 
first one, as its name implies, only cases with available information 
are considered, with cases with missing data being discarded. 
Concerning the imputation approach, missing data are imputed 
from the available information. Whatever the choice between 
single or multiple, an imputation model has to be established. Two 
approaches imputing missing formaldehyde values were exam- 
ined, the linear regression model (LM), and the partial least 
squares (PLS) model. PLS regression is particularly suited when 
there are more predictors variables than observations, and 
contrary to LM, it allows multicolinearity between variables 
[16]. PLS method is based on the reduction of predictor variables 
dimension by using techniques near principal component analysis. 
As recommended in the literature, the imputation model includes 
variables that are used in subsequent analyses such as the outcome 
[17,18]. Therefore, the imputation model included formaldehyde 
predictive factors and LRI or DNC. The predicted formaldehyde 
mean was used in the single imputation. Note that other 
approaches exist for single imputation where, for instance, the 
missing value is replaced by local or adaptative estimate [19,20,2 1] 
and not by the global mean. Whatever the chosen technique, the 
missing value is always replaced by a single value underestimating 
the variability due to this estimation. As the health outcomes are 
binary variables, the association between formaldehyde levels and 
LRI or DNC was then examined using logistic regression whatever 
the imputed model approach. 

In the multiple imputation approach, several imputations are 
generated for given missing data. As previously described by Little 
and Rubin [22], "m completed datasets" were firstly created by 
filling in the missing values through the imputation model: missing 
formaldehyde values were imputed randomly from an approxi- 
mate predictive distribution based on the fitted regression. For 
example in the case of LM, regression coefficients were sampled 
from their multivariate Gaussian distribution obtained on 
observed data and then missing formaldehyde values were 
replaced by their corresponding predicted values. This procedure 
was repeated m times. Here 10 000 imputed datasets were fitted. 
The completed datasets were analyzed separately, the association 
between formaldehyde levels (observed and imputed) and health 
was examined using logistic regression, and the results of all 
datasets were then combined, applying Rubin's rules, to yield final 
inference on the parameters of interest. The variance for the 
combined parameter estimates included between and within 
imputation variation. 

Finally, a fully Bayesian model was implemented, as suggested 
by Carpenter and Kenward [18]. Two sub-models were fitted 
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joindy using Markov Chain Monte Carlo methods [17]. The first 
one modelled the association between health indicator and 
exposure, and the second one modelled the relation between 
missing and observed exposure measurements including covariates 
supposed to be linked to exposure. Such joint modelling has 
advantages as mutually enhanced estimates precision of two sub- 
models parameters, extending multiple imputation methodology. 
The algorithm was run for 10 000 iterations with 1000 iterations 
discarded for burn-in. Inspection of posterior time series plots for 
the parameters as well as autocorrelation plots indicated that the 
model mixed well. For each model, posterior mean of OR with 
95% credibility interval (95% Cr) is shown for the formaldehyde 
exposure. Note that since the PLS approach is not based on a 
probability model, Bayesian modelling cannot be used. 

Simulation study 

Facing missing data, the choice of approach is crucial in terms 
of the conclusion, particularly when there is a high proportion of 
missing data (near 94% in our case), and weak associations. 
Comparisons between approaches have to be based on the quality 
of estimates, and on the ability to conclude or not a significant 
association. Simulation studies with characteristics near those of 
real data but controlling the true value of OR without omitting the 
case of no association (OR= 1.0) were therefore conducted. Two 
cases were considered: one frequent outcome similar to LRI 
(named "event 1"), and a second case close to DNC ("event 2"). 
Sample sizes are similar to those in real data sets (n = 2 551 for 
event 1 and n = 2 342 for event 2). 

Datasets were simulated from the following steps: In E ~ N(X 
<p; a 2 Id) and F, ~ B( %i) with logit( rcj = a. + /J £, + y Z iy fi = In 
OR true and i = 1, n, and where formaldehyde factors are 
denoted by X, exposure variable by E, covariates by Z and health 
indicator by F. Exposure variable (corresponding to formaldehyde 
in real dataset) was simulated on a logarithmic scale from a linear 
model with formaldehyde predictors obtained from real data [1 1], 
and coefficients (p were equal to those estimated in this study. 
Then, the health indicator was simulated from a logistic model 
with the resulting formaldehyde levels and covariates from real 
data (coefficients y associated with covariates were those estimated 
on real data as well as the residual variance a' 2 ). Formaldehyde 
predictors and covariates are given in Table SI. 

Three different OR trul , (and then three p trul! = In OR true ) between 
formaldehyde and event 1 or event 2 were considered: 1.0, 1.2, 
and 1.4. Missing values for formaldehyde were assigned at 
random. A case of 95% missing values was considered. To assess 
the robustness of our conclusions, simulations with different 
missing values percentages (85%, 75% and 50%) were also 
conducted. For each scenario, a total of 100 datasets were 
generated. This number of simulations is required to obtain an 
estimate of the regression coefficient associated with formaldehyde 
exposure within 10% of its true value when missing values are 
omitted in the real data. Indeed, the equation given by Burton 

et al. [2 3] for the number of simulations B is B = I — j 2 *- I here 5 is 

the specified level of accuracy of the estimate of interest 
"accepted", a, the standard error for the parameter of interest 
and, Z]_r,j2, the (l-a/2) quantile of the standard normal distribu- 
tion. When B — 100, a. — 0.05 and a 2 = 0.26 (estimated variance on 
real data set), <5 is equal to 10%. 

The quality of estimates for the different approaches was 
assessed by the root mean square error of beta coefficients (RMSE, 

/ ^ ^ 
RMSE = E (ft - Ptmef where ft is estimated p on i dataset 
V i=l 



(i=l,...,m) and CI 95% (RMSE) = [s/MSE M ; ^/MSE sup ] where 
MSE mf = MSE- 1.96^p> and MSE sup = MSE + 1.96 



and MSE = RMSE 2 ). The proportion of 



sd{MSE) 

associations (PS 



PS-- 



y\ — 



where 1 



'significant" 
1, if 



\4CI(OR,) 

= 0, otherwise) i.e. of confidence or 



1 £CI(ORi)= \ and 1 

r ' \4CI(OR<) 

credibility interval of OR excluding 1 was also calculated. This 
criterion assesses risk of type I error when OR trur = 1, and statistical 
power when OR true was different from 1 . 

Confidence intervals for RMSE were based on a Gaussian 
approximation using the empirical standard deviation of the MSE, 
and confidence intervals for the PS were based on the Clopper- 
Pearson "exact" confidence intervals [24] avoiding the Gaussian 
approximation. As the question is basically if formaldehyde 
exposure increased LRI and DNC risk according to previous 
studies in literature, the one-sided approach was chosen. Statistical 
analyses were conducted using R 2.14.0 [25] and WinBUGS 
software [26]. 

Results 

Results on real data 

In this study, 2 55 1 infants of the PARIS birth cohort were 
completely observed, independently of health outcomes and 
formaldehyde levels, and pollutant levels were available for 142 
of them. Most of infants lived in apartments. Nearly 30% of 
buildings were built after 1975, and two-thirds were equipped with 
double glazing. Around half of babies had wood-pressed products 
for flooring or varnished parquet floor in their bedroom. Recent 
(less than one year old) particle board furniture was present in 
49.8% dwellings. About half of infants (46.9%) had at least one 
LRI during their first year of life and the prevalence of DNC was 
14.9%. 

OR estimates between LRI or DNC occurrence and formal- 
dehyde exposure levels are given in Table 1. The single 
imputation techniques (LM or PLS) clearly induced high estimates 
of association compared to all other approaches. The estimated 
OR with fully Bayesian approach was similar to that obtained with 
multiple imputation particularly for LRI, but lower than that with 
PLS imputation. However, intervals were different leading to a 
significant association with Bayesian modelling for LRI and nearly 
significant for DNC but not significant with both multiple 
imputation. 

Simulations results 

From the 100 simulated datasets, the resulting mean prevalence 
of event 1 was 29.6% (range: 18.4%— 41.3%) and of event 2 was 
12.3% (range: 6.8%-18.5%). Table 2 shows RMSE and PS 
assessed on simulated data when no data are missing. Results were 
quasi similar for all approaches. As RMSE depends on OR true and 
on prevalence of event, it increased between events 1 and 2 and 
slightly with OR trnl! . Concerning event 1, frequentist and Bayesian 
approaches always concluded a significant association when 
OR true =lA while when OR true = 1 .2, statistic power was near 
60%. For an infrequent event as event 2, statistical power 
decreased being near 85% when OR trm , = 1.4 and near 40% when 
OR true = 1.2. When OR t ,-u e = 1.0, PS had to be equal to 5%, which 
was the case for the two approaches even if there was a slight 
increase for the less frequent event. These results can be 
considered as reference results as no data are missing. 
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Table 1. Associations between environmental risk factor 3 including missing values, and health outcomes, by different methods 
handling missing values (OR [95% CI or 95% Cr]). 









LRI b 


DNC C 


Na omitted 




1.11 [0.55, +oo) 


1.31 [0.45, +oo) 


Single imputation 


LM 


1.91 [1.53, +oo) 


5.63 [3.69, +=o) 




PLS d 


3.27 [1.61, +oo) 


3.69 [2.57, +oo) 


Multiple imputation 


LM 


1.28 [0.91, +oo) 


1.35 [0.14, +oo) 




PLS d 


2.81 [0.35, +oo) 


2.69 [0.39, +00) 


Bayesian approach 




1.27 [1.10, +oo) 


1.16 [0.95, +oo) 



Abbreviations: CI, 95% confidence interval; Cr, 95% credibility interval, LM, linear regression model; OR, odds ratio; PLS, partial least squares. 

a : Environmental factor: formaldehyde exposure, expressed in ug/m 3 , (for one unit increase in the logarithmic scale), and health outcome: lower respiratory infection 
(LRI) or dry nigh cough (DNC). 

b : Association was adjusted for gender, socio-economic status, siblings, parental history of asthma, breastfeeding, daycare attendance, pre/postnatal tobacco smoke 
exposure, sign(s) of dampness, and presence of pets at home. 

c : Association was adjusted for gender, socio-economic status, parental history of allergy, breastfeeding, pre/postnatal tobacco smoke exposure, gas heating, 
cockroaches, infant's mattress age, family events, and number of episodes of lower respiratory infections. 
d : PLS imputation with two components. 
doi:1 0.1 371 /journal.pone.01 04254.t001 



Table 3 provides results of the RMSE on replicates with 95% of 
missing values. RMSE values ranged from 0. 18 to 0.83 for event 1, 
and from 0.09 to 1.06 for event 2. For each event, RMSE slightly 
increased with OR. RMSE values were always lower in multiple 
imputation than in single imputation, whatever the imputation 
model (LM or PLS). Single imputation led to huge RMSE 
reflecting poor qualities of estimates. All results were confirmed 
with proportion of missing values of 75% and 85% (Table S2). 

Figure SI shows boxplots of beta estimates obtained from 
Bayesian approach, from 100 simulations for the three different 
values of 0R true . When OR true is equal to 1.0, the range of 
estimates increased with the proportion of missing values and with 
the decrease of event prevalence. These results were also observed 
when OR irue = 1.2 and 0R trm , = 1.4. Moreover, an underestima- 
tion of beta (and so of OR) was obtained when OR true = 1.4. 

When ORtrue is equal to 1 , real risk of type I error is assessed by 
PS while the theoretical one was fixed at 5% (Table S3). For event 
1 with 95 % of missing data, single imputation led to very high risks 
(23% [15.2, 32.5] and 21% [13.5, 30.3] for LM and PLS, 
respectively). If missing proportion is 75% or 85%, huge risks of 
type I error were again found for single imputation. For event 2, 
PS considerably increased for single imputation, 35% [25.7, 45.2] 
and 42% [32.2, 52.3] for LM and PLS, respectively. This increase 
was confirmed with 75% and 85% of missing values. Multiple 
imputation led to always conservative results explained by large 
confidence intervals of OR obtained with this approach whether 
for event 1 or 2. Bayesian approach led to increase risk of type I 
error. For event 1, this increase seems reasonable because 5% is 
always in the confidence interval (7% [2.9, 13.9] with 95% missing 
values). For the infrequent event 2, risk of type I error increases 
and excluding 5% from confidence intervals when 85% and 95% 
of data are missing (e.g. 16% [9.4, 24.7] with 95% missing values). 

Figure 1 presents PS when 0R trul! is greater than 1 and 95% of 
missing values for two events and for all approaches, excluding 
single imputation which had given a weak quality on estimates and 
overestimated risk of type I error. As expected, statistical power 
increased with 0R true and decreased for infrequent event. Weak 
performances were obtained for multiple linear and PLS 
imputation. This figure clearly shows highest PS for Bayesian 
model near reference power on complete data (Table 1) especially 
for event 1 , all other approaches giving a null or quasi null power 
for ORtnu,— 1.2 and OR trm! = 1.4, respectively. Figure S2 presents 



PS when OR true is greater than 1 for 85% and 75% of missing 
values. Highest PS were clearly obtained for Bayesian model. 

And with 50% of missing data, statistical power increased 
remaining the best for Bayesian approach especially for infrequent 
event (near 80% by Bayesian approach against 53% by LM 
multiple imputation when 0R tnle = 1.4). 

Discussion 

This paper addresses the crucial question of how to handle large 
amounts of missing data in environmental epidemiological surveys. 
Till now, as far as we know, very few teams have compared 
performances of approaches handling missing values with a 
proportion of missingness above 75% [27,28]. To solve this 
question, we used both real and simulated data to determine the 
most appropriate approach when there is a large amount of 
missing data. Results on RMSE and PS showed poor perfor- 
mances with single imputation. Fully Bayesian approach seems 
better, followed by imputation approaches, which in turn gave 
better results than omitting missing observations. 

As expected, we found that omitting missing values is less 
efficient than single and multiple imputations. Even if it is the 
easiest approach for handling missing values, it should be used 
only in presence of less than 5% of missing values [29] because it 
should induce a significant loss of statistical power: unrealistic 
when the health outcome is infrequent and problematic in 
environmental studies where pollutants have often a weak impact. 

Another common approach for handling missing values is to 
impute them before any analysis which is commonly used in 
environmental epidemiological studies to estimate exposure levels 
for all study members. An increase in power is the substantial 
benefit of this alternative over omitting missing values. Neverthe- 
less, the specification of the imputation model is an important step. 
As previously demonstrated, if the imputation model is not 
properly specified the imputation approach could introduce a bias 
which is not present in omitting missing values when missingness is 
MCAR [28]. It has been observed that including many relevant 
variables in the imputation model tends to make the missing at 
random assumption more plausible [30], even if computational 
problems could occur such as multicollrnearity and a large number 
of predictors might provide instable estimates. Previous authors 
showed the importance of including the health outcome because 



PLOS ONE | www.plosone.org 



4 



September 2014 | Volume 9 | Issue 9 | e1 04254 



Missing Data in Environmental Epidemiology 



n 
cc 
O 



n 
cc 
O 



rN rs >— ' >— < 



O <- i- 



N N ^ ^ 



















d 


d 


CO 


d 


OS" 


CX? 


MD 


r*- 


o 


o 


rC 




d 


d 


CO 


'OS] 


o 


o 










d 


d 


L/1 


vD 



° *~ i- 



O ,- r- 



i_ fC i_ 



Q_ 

S : 



o II 

of 2 
O 

o £ 

t > 

a; <u 
<d 

c Ln 

03 ~D [M 

c o 

^ D O 

8 £ si 



TO 



i m 

vj *": 

a; "5. ° 

-Q E 

-g !s o 

< i/l ~o 



IT) 



Q. 

CLI 



o 
o 



0) 

y 

O 



=3 



<1J 

E 

o 
o 
cc 

M 

91 

Si 
IS 



CO 




m 






vo 




q 




m 


CO 


rM 


d 






d 


d 


d 


u-i 


IT) 


rsi 












CO 


rM 






d 


d 


d 


d 


d 


d 
















O 




fN 






d 




d 


d 


d 


d 



o 


m 








r-. 




q 




m 




rM 


d 




d 


d 


d 


d 


CO 


cv 


o 


vo 


!*». 


o 


o 


00 




rM 


LA 


rM 


d 


d 


d 


d 


d 


d 


av 


vo 




OV 






o 


cv 


CO 






rM 


d 


d 


d 


d 


d 


d 



\0 LO CT\ 



I Q_ I 



a 
£ 



a 
E 



o 

a 
a 



m 



0J rsi 



»/) II 
a. ~Z. 



— - LTt 



m £ 



5 o 

< T3 



PLOS ONE | www.plosone.org 



5 



September 2014 | Volume 9 | Issue 9 | e1 04254 



Missing Data in Environmental Epidemiology 



100 n 



90 



70 



60 



~ 50 



S 40 



■ 30 



OR=1.2 




OR=1.4 









1 




> 



Figure 1. Proportions of significant associations based on 100 replicates, for each approach dealing with 95% of missing 
values 
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Figure 1. Proportions of significant associations based on 100 replicates, for each approach dealing with 95% missing values. 

doi:10.1371/journal.pone.0104254.g001 



regression coefficients came close to the truth [31,32]. Note that 
inclusion of the health outcome implies that the imputation 
procedure has to be renewed for each new health outcome. 

Single imputation is easy to implement, but the major 
disadvantage is the overestimation of association between exposure 
and health outcome increasing with the strength of the association. 
As explained by Rubin [4], this overstatement is certainly due to 
that one imputed value cannot itself represent uncertainty about 
imputed value. 

Conversely, multiple imputation takes into account the uncer- 
tainty and thus, does not underestimate the variance of estimates. 
But this approach is conservative. In fact, as previously described 
in the literature [31], large imprecision of OR estimates was 
observed, thus yielding to no significant associations. Moreover, 
concerning the choice of the number of imputations, it has been 
suggested [4], that less than 10 imputed datasets are useful 
compared with an infinite number of imputations. However, when 
the percentage of missing values is huge, more than 10 imputations 
may be needed and the number of imputed data should 
approximate the number of observations with missing data, as 
previously suggested [33]. Although 2 000 imputations could have 
been sufficient in this study, we fitted 1 0 000 imputed datasets as a 
precaution even if it was time consuming. Bayesian joint modelling 
appears to be less conservative with a statistical power near 75% 
for event 1 (near 55% for the infrequent event 2) when 



ORtrue— 1.4 with 95% of missing data. Nevertheless, it is 
important to notice that risk of type I error tends to increase, 
slightly for event 1 (near 7%) and much more for the infrequent 
event 2 (near 16%). Even if the RMSE is always smaller for the 
Bayesian approach compared to the other approaches and 
indicates a better performance in terms of estimates, boxplots of 
the beta estimates from 100 simulations under OR bull = 1.0 clearly 
show that the range of estimates increases with decrease of event 
prevalence and with increase in the missing values proportion. 
This result is confirmed when OR is greater than one but a bias 
appears when OR true increases. Thus, caution should be taken 
when interpreting results for an infrequent event with high 
proportion of missing values. In addition, the use of empirical 
approaches (e.g. bootstrap, Monte Carlo study) could be useful to 
assess the real risk of type I error. 

For infrequent events (for instance with prevalence less than 
10%) and high proportion of missing values, statistical power 
remains too weak. Bayesian approach offers the possibility to 
obtain easily estimated posterior distribution of OR which could 
be a useful tool to refine conclusions. Posterior probability of OR 
being smaller than 1 can indeed be deduced and such probability 
between 5% and 10% could be considered weak enough to 
conclude an "almost significant" association. This strategy would 
lead to three possible conclusions: "not significant", "significant" 
or "almost significant". Thus, for event 2, the classical approach 
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yields 46 non significant associations when 0R trlle = 1 .4, but 
among them, 7 would be declared as "almost significant". It is 
noteworthy that the proportion of non significant associations now 
labeled "almost significant" increases with OR trul , from 8.8% 
(OR true = 1.2) to 15.2% (OR true = 1.4). If the type I error is assessed 
only among associations not declared "almost significant", it 
remains stable (16% and 17.8% for classical and this new strategy, 
respectively). 

In conclusion, among the methods dealing with missing data, no 
approach is absolutely better than the others in all circumstances. 
In the presence of high proportions of missing values, using only 
complete data yields to a significant loss of statistical power. Single 
imputation underestimates the variance, thus overestimating the 
association between environmental factor and health outcome. 
Multiple imputation, due to overcoverage, is too conservative and 
unable to show significant associations. When the health outcome 
is frequent, joint modelling seems to be more efficient than other 
approaches, combining low RMSE, limited increase of risk of type 
I error, and high statistical power. The simulation study is useful 
for explaining the disparity of associations found in the real data, 
for example for LRI (Table 1) corresponding to a frequent event. 
Indeed, the characteristics of each method highlighted by the 
simulation study are found in the real case, i.e. bias using simple 
imputation, lack of power using multiple imputation, and 
significant association using Bayesian approach. The conclusion 
of a significant association between formaldehyde exposure and 
LRI is strengthened. With regards to the infrequent event, DNC, 
only a tendency of an association is observed. No approach gives 
completely satisfactory results when the health outcome is 
infrequent and the proportion of missing values is high. Though 
the Bayesian modelling has the best power and precision of 
estimates, this comes at a cost of inflated risk of type I error. 
However, estimated posterior distribution of OR would be helpful 
to refine the conclusion by introducing a new category of "almost 
significant association" when probability of OR less than 1 is 
between 5% and 10%. Concerning inflation of risk of type I error, 
correcting methodology as bootstrap could be implemented. This 
would lead certainly to very huge computer time as MCMC 
iterative algorithm would be used on each bootstrapped sample. 
An alternative approach to MCMC in repeated Bayesian 

References 

1 . Vach W, Blettner M (2005) Missing data in epidemiologic studies. Encyclopedia 
of Biostatistics: John Wiley & Sons. 

2. Jia C, D'SouzaJ, Batterman S (2008) Distributions of personal VOC exposures: 
a population-based analysis. Environ Int 34: 922-931. 

3. Fusco D, Forasticrc F, Michclozzi P, Spadca T, Ostro B, et al. (2001) Air 
pollution and hospital admissions for respiratory conditions in Rome, Italy. Eur 
RespirJ 17: 1143-1150. 

4. Rubin DB (2004) Multiple imputation for nonresponse in surveys: Wiley. 

5. Barzi F, Woodward M (2004) Imputations of missing values in practice: results 
from imputations of scrum cholesterol in 28 cohort studies. Am J Epidemiol 
160: 34-45. 

6. Schafer J (1997) Analysis of incomplete multivariate data. Monographs on 
statistics and applied probability. 

7. KlebanoffMA, Cole SR (2008) Use of multiple imputation in the epidemiologic 
literature. Am J Epidemiol 168: 355-357. 

8. Carrigan G, Barnett AG, Dobson AJ, Mishra G (2007) Compensating for 
Missing Data from Longitudinal Studies Using WinBUGS. J Stat Softw 19: 
1-17. 

9. Gryparis A, Paciorck CJ, Zcka A, Schwartz J, Coull BA (2009) Measurement 
error caused by spatial misalignment in environmental epidemiology. Biostatis- 
tics 10: 258-274. 

10. Gclman A, Carlin JB, Stern HS, Rubin DB (2003) Bayesian Data Analysis. 2nd 
Edition. CRC Press. 

11. Roda C, Kousignian I, Guihenneuc-Jouyaux C, Dassonville C, Nicolis I, ct al. 
(2011) Formaldehyde Exposure and Lower Respiratory Infections in Infants: 
Findings from the PARIS Cohort Study. Environ Health Perspect 119: 1653- 
1658. 



estimations such as "Importance Sampling" could be envisaged 
[34]. 

Supporting Information 

Figure SI Boxplots of fi (ft = In OR) estimates under 
Bayesian approach from 100 simulated datasets for the 
three different values of true OR dealing with no missing 
values, 75%, 85% and 95% of missing values. 

(PDF) 

Figure S2 Proportions of significant associations based 
on 100 replicates, for each approach dealing with 85%, 
and 75% of missing values. 

(PDF) 

Table SI Predictive factors of formaldehyde exposure 
and covariates used for adjustment in the model relating 
formaldehyde exposure and health indicator. 

(DOC) 

Table S2 Root mean square error of beta coefficients 
with 95% confidence interval based on 100 replicates 
with 85%, and 75% of missing values. 

(DOC) 

Table S3 Proportions of significant associations with 
95% confidence interval based on 100 replicates with 
95%, 85% and 75% of missing values when OR true = 1. 

(DOC) 

Acknowledgments 

The authors wish to thank the families for their participation and the 
administrative staff for their involvement in the PARIS study. The authors 
would like to thank the editor and the referees for helpful suggestions that 
lead to significant improvements. 

Author Contributions 

Conceived and designed the experiments: CR LM CG. Analyzed the data: 
CR IN CG. Contributed reagents/materials/analysis tools: CR IN IM 
CG. Wrote the paper: CR IN IM CG. 



12. Clarisse B, Nikasinovic L, Poinsard R, Just J, Momas I (2007) The Paris 
prospective birth cohort study: which design and who participates? Eur J Epi- 
demiol 22: 203-210. 

13. Roda C, Guihenneuc-Jouyaux C, Momas I (2013) Environmental triggers of 
nocturnal dry cough in infancy: New insights about chronic domestic exposure 
to formaldehyde in the PARIS birth cohort. Environ Res 123: 46—51. 

14. Dassonville C, Demattci C, Laurent AM, Le Moullec Y, Seta N, ct al. (2009) 
Assessment and predictor determination of indoor aldehyde levels in Paris 
newborn babies' homes. Indoor Air 19: 314—323. 

15. Rubin DB (1976) Inference and missing data. Biomctrika 63: 581. 

16. Tenenhaus M (1998) [La regression PLS: thcoric et pratique]: Teehnip. 

1 7. Gilks WR, Richardson S, Spiegelhalter DJ (1996) Markov Chain Monte Carlo in 
Practice. London, UK: Chapman & Hall. 

18. Carpenter J, Kcnward M (2005) Example Analyses Using WinBUGS 1.4. 
Available: http://www.missingdata.org.uk/. Accessed 2011 Feb 2. 

19. Bo TH, Dysvik B, Jonassen I (2004) LSimpute: accurate estimation of missing 
values in microarray data with least squares methods. Nucleic Acids Res 32: c34. 

20. Brock GN, Shaffer JR, Blakcslcy RE, Lotz MJ, Tseng GC (2008) Which missing 
value imputation method to use in expression profiles: a comparative study and 
two selection schemes. BMC Bioinformatics 9: 12. 

21. Kim H, Golub GH, Park H (2005) Missing value estimation for DNA 
microarray gene expression data: local least squares imputation. Bioinformatics 
21: 187-198. 

22. Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edition. 
New York: John Wiley. 

23. Burton A, Altman DG, Royston P, Holder RL (2006) The design of simulation 
studies in medical statistics. Stat Med 25: 4279-4292. 



PLOS ONE | www.plosone.org 



7 



September 2014 | Volume 9 | Issue 9 | e1 04254 



Missing Data in Environmental Epidemiology 



24. Cloppcr C, Pearson E (1934) The use of confidence or fiducial limits illustrated 
in the case of the binomial. Biometrika 26: 404—413. 

25. R Development Core Team (2011) R: A language and environment for 
statistical computing. In: R Foundation for Statistical Computing, editor. 
Vienna, Austria. 

26. Lunn DJ, Thomas A, Best N, Spiegelhalter D (2000) WinBUGS-A Bayesian 
modelling framework: Concepts, structure, and extensibility. Stat Comput 10: 
325-337. 

27. Scheffer J (2002) Dealing with Missing Data. Res Lett Inf Math Sei: 153-160. 

28. Lee KJ, CarlinJB (2012) Recovery of information from multiple imputation: a 
simulation study. Emerg Themes Epidemiol 9: 3. 

29. Rubin DB (1987) Multiple imputation for nonresponse in surveys. New York, 
NY: John Wiley and Sons, Inc. 



30. Steyerberg EW (2009) Clinical prediction models: A practical approach to 
development, validation, and updating: Springer. 

3 1 . Moons KG, Donders RA, Stijnen T, Harrell FEJr (2006) Using the outcome for 
imputation of missing predictor values was preferred. J Clin Epidemiol 59: 
1092-1101. 

32. Sterne JA, White IR, CarlinJB, Spratt M, Royston P, et al. (2009) Multiple 
imputation for missing data in epidemiological and clinical research: potential 
and pitfalls. BMJ 338: b2393. 

33. Bodncr TE (2008) What improves with increased missing data imputations? 
Struct Equ Modeling-a Multidisciplinary Journal 15: 651-675. 

34. Gajda D, Guihcnncuc-Jouyaux C, RousscauJ, Mengersen K, Nur D (2010) Use 
in practice of importance sampling for repeated MCMC for Poisson models. 
ElectronJ Stat 4: 361-383. 



PLOS ONE | www.plosone.org 



8 



September 2014 | Volume 9 | Issue 9 | e104254 



